belief mdp
Active Measuring in Reinforcement Learning With Delayed Negative Effects
Gao, Daiqi, Xu, Ziping, Rawashdeh, Aseel, Klasnja, Predrag, Murphy, Susan A.
Measuring states in reinforcement learning (RL) can be costly in real-world settings and may negatively influence future outcomes. We introduce the Actively Observable Markov Decision Process (AOMDP), where an agent not only selects control actions but also decides whether to measure the latent state. The measurement action reveals the true latent state but may have a negative delayed effect on the environment. We show that this reduced uncertainty may provably improve sample efficiency and increase the value of the optimal policy despite these costs. We formulate an AOMDP as a periodic partially observable MDP and propose an online RL algorithm based on belief states. To approximate the belief states, we further propose a sequential Monte Carlo method to jointly approximate the posterior of unknown static environment parameters and unobserved latent states. We evaluate the proposed algorithm in a digital health application, where the agent decides when to deliver digital interventions and when to assess users' health status through surveys.
Control Synthesis in Partially Observable Environments for Complex Perception-Related Objectives
Perception-related tasks often arise in autonomous systems operating under partial observability. This work studies the problem of synthesizing optimal policies for complex perception-related objectives in environments modeled by partially observable Markov decision processes. To formally specify such objectives, we introduce \emph{co-safe linear inequality temporal logic} (sc-iLTL), which can define complex tasks that are formed by the logical concatenation of atomic propositions as linear inequalities on the belief space of the POMDPs. Our solution to the control synthesis problem is to transform the \mbox{sc-iLTL} objectives into reachability objectives by constructing the product of the belief MDP and a deterministic finite automaton built from the sc-iLTL objective. To overcome the scalability challenge due to the product, we introduce a Monte Carlo Tree Search (MCTS) method that converges in probability to the optimal policy. Finally, a drone-probing case study demonstrates the applicability of our method.
Convex Is Back: Solving Belief MDPs With Convexity-Informed Deep Reinforcement Learning
Koutas, Daniel, Hettegger, Daniel, Papakonstantinou, Kostas G., Straub, Daniel
We present a novel method for Deep Reinforcement Learning (DRL), incorporating the convex property of the value function over the belief space in Partially Observable Markov Decision Processes (POMDPs). We introduce hard- and soft-enforced convexity as two different approaches, and compare their performance against standard DRL on two well-known POMDP environments, namely the Tiger and FieldVisionRockSample problems. Our findings show that including the convexity feature can substantially increase performance of the agents, as well as increase robustness over the hyperparameter space, especially when testing on out-of-distribution domains. The source code for this work can be found at https://github.com/Dakout/Convex_DRL.
Value of Information and Reward Specification in Active Inference and POMDPs
Expected free energy (EFE) is a central quantity in active inference which has recently gained popularity due to its intuitive decomposition of the expected value of control into a pragmatic and an epistemic component. While numerous conjectures have been made to justify EFE as a decision making objective function, the most widely accepted is still its intuitiveness and resemblance to variational free energy in approximate Bayesian inference. In this work, we take a bottom up approach and ask: taking EFE as given, what's the resulting agent's optimality gap compared with a reward-driven reinforcement learning (RL) agent, which is well understood? By casting EFE under a particular class of belief MDP and using analysis tools from RL theory, we show that EFE approximates the Bayes optimal RL policy via information value. We discuss the implications for objective specification of active inference agents.
A Federated Online Restless Bandit Framework for Cooperative Resource Allocation
Tong, Jingwen, Li, Xinran, Fu, Liqun, Zhang, Jun, Letaief, Khaled B.
Restless multi-armed bandits (RMABs) have been widely utilized to address resource allocation problems with Markov reward processes (MRPs). Existing works often assume that the dynamics of MRPs are known prior, which makes the RMAB problem solvable from an optimization perspective. Nevertheless, an efficient learning-based solution for RMABs with unknown system dynamics remains an open problem. In this paper, we study the cooperative resource allocation problem with unknown system dynamics of MRPs. This problem can be modeled as a multi-agent online RMAB problem, where multiple agents collaboratively learn the system dynamics while maximizing their accumulated rewards. We devise a federated online RMAB framework to mitigate the communication overhead and data privacy issue by adopting the federated learning paradigm. Based on this framework, we put forth a Federated Thompson Sampling-enabled Whittle Index (FedTSWI) algorithm to solve this multi-agent online RMAB problem. The FedTSWI algorithm enjoys a high communication and computation efficiency, and a privacy guarantee. Moreover, we derive a regret upper bound for the FedTSWI algorithm. Finally, we demonstrate the effectiveness of the proposed algorithm on the case of online multi-user multi-channel access. Numerical results show that the proposed algorithm achieves a fast convergence rate of $\mathcal{O}(\sqrt{T\log(T)})$ and better performance compared with baselines. More importantly, its sample complexity decreases with the number of agents.
Learning Explainable and Better Performing Representations of POMDP Strategies
Bork, Alexander, Chakraborty, Debraj, Grover, Kush, Kretinsky, Jan, Mohr, Stefanie
Strategies for partially observable Markov decision processes (POMDP) typically require memory. One way to represent this memory is via automata. We present a method to learn an automaton representation of a strategy using a modification of the L*-algorithm. Compared to the tabular representation of a strategy, the resulting automaton is dramatically smaller and thus also more explainable. Moreover, in the learning process, our heuristics may even improve the strategy's performance. In contrast to approaches that synthesize an automaton directly from the POMDP thereby solving it, our approach is incomparably more scalable.
Intermittently Observable Markov Decision Processes
Chen, Gongpu, Liew, Soung-Chang
This paper investigates MDPs with intermittent state information. We consider a scenario where the controller perceives the state information of the process via an unreliable communication channel. The transmissions of state information over the whole time horizon are modeled as a Bernoulli lossy process. Hence, the problem is finding an optimal policy for selecting actions in the presence of state information losses. We first formulate the problem as a belief MDP to establish structural results. The effect of state information losses on the expected total discounted reward is studied systematically. Then, we reformulate the problem as a tree MDP whose state space is organized in a tree structure. Two finite-state approximations to the tree MDP are developed to find near-optimal policies efficiently. Finally, we put forth a nested value iteration algorithm for the finite-state approximations, which is proved to be faster than standard value iteration. Numerical results demonstrate the effectiveness of our methods.
Convergence of Finite Memory Q-Learning for POMDPs and Near Optimality of Learned Policies under Filter Stability
Kara, Ali Devran, Yuksel, Serdar
In this paper, for POMDPs, we provide the convergence of a Q learning algorithm for control policies using a finite history of past observations and control actions, and, consequentially, we establish near optimality of such limit Q functions under explicit filter stability conditions. We present explicit error bounds relating the approximation error to the length of the finite history window. We establish the convergence of such Q-learning iterations under mild ergodicity assumptions on the state process during the exploration phase. We further show that the limit fixed point equation gives an optimal solution for an approximate belief-MDP. We then provide bounds on the performance of the policy obtained using the limit Q values compared to the performance of the optimal policy for the POMDP, where we also present explicit conditions using recent results on filter stability in controlled POMDPs. While there exist many experimental results, (i) the rigorous asymptotic convergence (to an approximate MDP value function) for such finite-memory Q-learning algorithms, and (ii) the near optimality with an explicit rate of convergence (in the memory size) are results that are new to the literature, to our knowledge.
Under-Approximating Expected Total Rewards in POMDPs
Bork, Alexander, Katoen, Joost-Pieter, Quatmann, Tim
We consider the problem: is the optimal expected total reward to reach a goal state in a partially observable Markov decision process (POMDP) below a given threshold? We tackle this -- generally undecidable -- problem by computing under-approximations on these total expected rewards. This is done by abstracting finite unfoldings of the infinite belief MDP of the POMDP. The key issue is to find a suitable under-approximation of the value function. We provide two techniques: a simple (cut-off) technique that uses a good policy on the POMDP, and a more advanced technique (belief clipping) that uses minimal shifts of probabilities between beliefs. We use mixed-integer linear programming (MILP) to find such minimal probability shifts and experimentally show that our techniques scale quite well while providing tight lower bounds on the expected total reward.
Monte Carlo Information-Oriented Planning
Thomas, Vincent, Hutin, Gérémy, Buffet, Olivier
In this article, we discuss how to solve information-gathering problems expressed as rho-POMDPs, an extension of Partially Observable Markov Decision Processes (POMDPs) whose reward rho depends on the belief state. Point-based approaches used for solving POMDPs have been extended to solving rho-POMDPs as belief MDPs when its reward rho is convex in B or when it is Lipschitz-continuous. In the present paper, we build on the POMCP algorithm to propose a Monte Carlo Tree Search for rho-POMDPs, aiming for an efficient on-line planner which can be used for any rho function. Adaptations are required due to the belief-dependent rewards to (i) propagate more than one state at a time, and (ii) prevent biases in value estimates. An asymptotic convergence proof to epsilon-optimal values is given when rho is continuous. Experiments are conducted to analyze the algorithms at hand and show that they outperform myopic approaches.